SIMD instructions in Java
With the realease of Java 16 it is now possible to use SIMD (Single Instruction Mutlitple Data) instructions to harness the full power of the CPU. This is made
possible by the Vector API.
Because the API is still an incubator (non-final) feature, the following arguments have to be passed to the VM to use it: --enable-preview --add-modules jdk.incubator.vector
Terminology - shapes and species
The shape of a vector is its size in bits (e.g. 256). A species is a combination of an element type (int, float, ...) and a shape.
The optimal shape (size in bits) and thus the optimal species of a vector varies from CPU to CPU, which is why each subclass of Vector has a static SPECIES_PREFERRED attribute to get
the optimal species. The call IntVector.SPECIES_PREFERRED for example, returns the optimal species for processing integers on the currently running Java platform.
Finally, the number of lanes in a vector is the number of its elements. For instance, a DoubleVector with a shape of 256 bit has 4 lanes (because 64*4 = 256).
Example: Adding a number to each element of an array
One simple use case for SIMD instructions is adding one number to each element of an array:
import jdk.incubator.vector.*;
...
//get the preferred species for float vectors
final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
float[] array = new float[29];
Arrays.fill(array, 5.0f);
float[] result = new float[array.length];
float addValue = 10.0f;
/*instead of adding 10.0f, we could also add a vector that has the value
10.0f for each element:
FloatVector addValue = FloatVector.broadcast(SPECIES, 10.0f);*/
int i = 0; //declare i here so we can use it after the first for loop
for(; i < SPECIES.loopBound(array.length); i += SPECIES.length()) {
//get a float vector from the array at index i
FloatVector vec = FloatVector.fromArray(SPECIES, array, i);
//add 10.0 to each element of the vector
FloatVector sum = vec.add(addValue);
//put the result in the result array at index i
sum.intoArray(result, i);
}
//process the rest of the array
for(; i < array.length; i++) {
result[i] = array[i] + addValue;
}
Using Masks
Most operations also accept a mask. A VectorMask<E> has a boolean value for each lane. An operation will only be executed on the lanes on which
the mask is true. For example, we could multiply each value of an array by 2 but only if the number is less than 50:
//get the preferred species for float vectors
final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED;
//fill array with random numbers
SplittableRandom random = new SplittableRandom();
int[] array = new int[42];
for(int i = 0; i < array.length; i++) {
array[i] = random.nextInt(100);
}
int[] result = new int[array.length];
int i = 0;
for(; i < SPECIES.loopBound(array.length); i += SPECIES.length()) {
//get an int vector from the array at index i
IntVector vec = IntVector.fromArray(SPECIES, array, i);
//get the mask that is true for each lane where the value of vec is less
//than 50
VectorMask<Integer> mask = vec.lt(50);
//multiply by 2 (only on the lanes where the mask is true) and write the
//result to the array
vec.mul(2, mask).intoArray(result, i);
}
//process the rest of the array
for(; i < array.length; i++) {
int num = array[i];
result[i] = (num < 50) ? (num * 2) : num;
}
Be aware, that sometimes the JVM might already optimize your code so that implementing your own vectorization might not bring any performace increase. Additionally, the performance increase can greatly differ from CPU to CPU. For example, a CPU that supports the AVX-512 instruction set can handle up to 512 bits per instruction which is twice as much as with AVX2.
Sources
| [1] |
Oracle - JDK 21 Docs |
| [2] | Baeldung |